Computation and Language 7
♻ ☆ Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination
Tree of Thoughts (ToT) is a reasoning strategy for Large Language Models
(LLMs) that employs a generator to suggest reasoning steps and a discriminator
to decide which steps to implement. ToT demonstrates strong performance on
reasoning tasks, often surpassing simple methods such as Input-Output (IO)
prompting and Chain-of-Thought (CoT) reasoning. However, ToT does not
consistently outperform such simpler methods across all models, leaving large
knowledge gaps on the conditions under which ToT is most beneficial. In this
paper, we analyze the roles of the generator and discriminator separately to
better understand the conditions when ToT is beneficial. We find that the
generator plays a more critical role than the discriminator in driving the
success of ToT. Scaling the generator leads to notable improvements in ToT
performance, even when using a smaller model as the discriminator, whereas
scaling the discriminator with a fixed generator yields only marginal gains.
Our results show that models across different scales exhibit comparable
discrimination capabilities, yet differ significantly in their generative
performance for ToT.
comment: Code: github.com/mainlp/tot-eval
♻ ☆ A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning
Tool use, planning, and feedback learning are currently three prominent
paradigms for developing Large Language Model (LLM)-based agents across various
tasks. Although numerous frameworks have been devised for each paradigm, their
intricate workflows and inconsistent taxonomy create challenges in
understanding and reviewing the frameworks across different paradigms. This
survey introduces a unified taxonomy to systematically review and discuss these
frameworks. Specifically, 1) the taxonomy defines environments/tasks, common
LLM-profiled roles or LMPRs (policy models, evaluators, and dynamic models),
and universally applicable workflows found in prior work, and 2) it enables a
comparison of key perspectives on the implementations of LMPRs and workflow
designs across different agent paradigms and frameworks. 3) Finally, we
identify three limitations in existing workflow designs and systematically
discuss the future work. Resources have been made publicly available at in our
GitHub repository https://github.com/xinzhel/LLM-Agent-Survey.
comment: Under Review
♻ ☆ ESpeW: Robust Copyright Protection for LLM-based EaaS via Embedding-Specific Watermark
Embeddings as a Service (EaaS) is emerging as a crucial role in AI
applications. Unfortunately, EaaS is vulnerable to model extraction attacks,
highlighting the urgent need for copyright protection. Although some
preliminary works propose applying embedding watermarks to protect EaaS, recent
research reveals that these watermarks can be easily removed. Hence, it is
crucial to inject robust watermarks resistant to watermark removal attacks.
Existing watermarking methods typically inject a target embedding into
embeddings through linear interpolation when the text contains triggers.
However, this mechanism results in each watermarked embedding having the same
component, which makes the watermark easy to identify and eliminate. Motivated
by this, in this paper, we propose a novel embedding-specific watermarking
(ESpeW) mechanism to offer robust copyright protection for EaaS. Our approach
involves injecting unique, yet readily identifiable watermarks into each
embedding. Watermarks inserted by ESpeW are designed to maintain a significant
distance from one another and to avoid sharing common components, thus making
it significantly more challenging to remove the watermarks. Extensive
experiments on four popular datasets demonstrate that ESpeW can even watermark
successfully against a highly aggressive removal strategy without sacrificing
the quality of embeddings. Code is available at
https://github.com/liudan193/ESpeW.
♻ ☆ Advancing Interpretability in Text Classification through Prototype Learning
Deep neural networks have achieved remarkable performance in various
text-based tasks but often lack interpretability, making them less suitable for
applications where transparency is critical. To address this, we propose
ProtoLens, a novel prototype-based model that provides fine-grained,
sub-sentence level interpretability for text classification. ProtoLens uses a
Prototype-aware Span Extraction module to identify relevant text spans
associated with learned prototypes and a Prototype Alignment mechanism to
ensure prototypes are semantically meaningful throughout training. By aligning
the prototype embeddings with human-understandable examples, ProtoLens provides
interpretable predictions while maintaining competitive accuracy. Extensive
experiments demonstrate that ProtoLens outperforms both prototype-based and
non-interpretable baselines on multiple text classification benchmarks. Code
and data are available at
\url{https://anonymous.4open.science/r/ProtoLens-CE0B/}.
♻ ☆ Interação entre robôs humanoides: desenvolvendo a colaboração e comunicação autônoma
Moraes Pablo, Rodríguez Mónica, Peters Christopher, Sodre Hiago, Mazondo Ahilen, Sandin Vincent, Barcelona Sebastian, Moraes William, Fernández Santiago, Assunção Nathalie, de Vargas Bruna, Dörnbach Tobias, Kelbouscas André, Grando Ricardo
This study investigates the interaction between humanoid robots NAO and
Pepper, emphasizing their potential applications in educational settings. NAO,
widely used in education, and Pepper, designed for social interactions, of er
new opportunities for autonomous communication and collaboration. Through a
series of programmed interactions, the robots demonstrated their ability to
communicate and coordinate actions autonomously, highlighting their potential
as tools for enhancing learning environments. The research also explores the
integration of emerging technologies, such as artificial intelligence, into
these systems, allowing robots to learn from each other and adapt their
behavior. The findings suggest that NAO and Pepper can significantly contribute
to both technical learning and the development of social and emotional skills
in students, of ering innovative pedagogical approaches through the use of
humanoid robotics.
comment: in Portuguese language
♻ ☆ Evaluating AI-Generated Essays with GRE Analytical Writing Assessment
The recent revolutionary advance in generative AI enables the generation of
realistic and coherent texts by large language models (LLMs). Despite many
existing evaluation metrics on the quality of the generated texts, there is
still a lack of rigorous assessment of how well LLMs perform in complex and
demanding writing assessments. This study examines essays generated by ten
leading LLMs for the analytical writing assessment of the Graduate Record Exam
(GRE). We assessed these essays using both human raters and the e-rater
automated scoring engine as used in the GRE scoring pipeline. Notably, the
top-performing Gemini and GPT-4o received an average score of 4.78 and 4.67,
respectively, falling between "generally thoughtful, well-developed analysis of
the issue and conveys meaning clearly" and "presents a competent analysis of
the issue and conveys meaning with acceptable clarity" according to the GRE
scoring guideline. We also evaluated the detection accuracy of these essays,
with detectors trained on essays generated by the same and different LLMs.
comment: 20 pages, 6 figures
♻ ☆ BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O. Arik, Danqi Chen, Tao Yu
Existing retrieval benchmarks primarily consist of information-seeking
queries (e.g., aggregated questions from search engines) where keyword or
semantic-based retrieval is usually sufficient. However, many complex
real-world queries require in-depth reasoning to identify relevant documents
that go beyond surface form matching. For example, finding documentation for a
coding question requires understanding the logic and syntax of the functions
involved. To better benchmark retrieval on such challenging queries, we
introduce BRIGHT, the first text retrieval benchmark that requires intensive
reasoning to retrieve relevant documents. Our dataset consists of 1,384
real-world queries spanning diverse domains, such as economics, psychology,
mathematics, and coding. These queries are drawn from naturally occurring and
carefully curated human data. Extensive evaluation reveals that even
state-of-the-art retrieval models perform poorly on BRIGHT. The leading model
on the MTEB leaderboard (Muennighoff et al., 2023), which achieves a score of
59.0 nDCG@10, produces a score of nDCG@10 of 18.3 on BRIGHT. We show that
incorporating explicit reasoning about the query improves retrieval performance
by up to 12.2 points. Moreover, incorporating retrieved documents from the
top-performing retriever boosts question-answering performance by over 6.6
points. We believe that BRIGHT paves the way for future research on retrieval
systems in more realistic and challenging settings.
comment: 48 pages
